336 research outputs found
Bag-of-visual-words expansion using visual relatedness for video indexing, SIGIR ’08
Bag-of-visual-words (BoW) has been popular for visual classification in recent years. In this paper, we propose a novel BoW expansion method to alleviate the effect of visual word correlation problem. We achieve this by diffusing the weights of visual words in BoW based on visual word relatedness, which is rigorously defined within a visual ontology. The proposed method is tested in video indexing experiment on TRECVID-2006 video retrieval benchmark, and an improvement of 7 % over the traditional BoW is reported
Structuring lecture videos for distance learning applications. ISMSE
This paper presents an automatic and novel approach in structuring and indexing lecture videos for distance learning applications. By structuring video content, we can support both topic indexing and semantic querying of multimedia documents. In this paper, our aim is to link the discussion topics extracted from the electronic slides with their associated video and audio segments. Two major techniques in our proposed approach include video text analysis and speech recognition. Initially, a video is partitioned into shots based on slide transitions. For each shot, the embedded video texts are detected, reconstructed and segmented as high-resolution foreground texts for commercial OCR recognition. The recognized texts can then be matched with their associated slides for video indexing. Meanwhile, both phrases (title) and keywords (content) are also extracted from the electronic slides to spot the speech signals. The spotted phrases and keywords are further utilized as queries to retrieve the most similar slide for speech indexing. 1
Exploring Object Relation in Mean Teacher for Cross-Domain Detection
Rendering synthetic data (e.g., 3D CAD-rendered images) to generate
annotations for learning deep models in vision tasks has attracted increasing
attention in recent years. However, simply applying the models learnt on
synthetic images may lead to high generalization error on real images due to
domain shift. To address this issue, recent progress in cross-domain
recognition has featured the Mean Teacher, which directly simulates
unsupervised domain adaptation as semi-supervised learning. The domain gap is
thus naturally bridged with consistency regularization in a teacher-student
scheme. In this work, we advance this Mean Teacher paradigm to be applicable
for cross-domain detection. Specifically, we present Mean Teacher with Object
Relations (MTOR) that novelly remolds Mean Teacher under the backbone of Faster
R-CNN by integrating the object relations into the measure of consistency cost
between teacher and student modules. Technically, MTOR firstly learns
relational graphs that capture similarities between pairs of regions for
teacher and student respectively. The whole architecture is then optimized with
three consistency regularizations: 1) region-level consistency to align the
region-level predictions between teacher and student, 2) inter-graph
consistency for matching the graph structures between teacher and student, and
3) intra-graph consistency to enhance the similarity between regions of same
class within the graph of student. Extensive experiments are conducted on the
transfers across Cityscapes, Foggy Cityscapes, and SIM10k, and superior results
are reported when comparing to state-of-the-art approaches. More remarkably, we
obtain a new record of single model: 22.8% of mAP on Syn2Real detection
dataset.Comment: CVPR 2019; The codes and model of our MTOR are publicly available at:
https://github.com/caiqi/mean-teacher-cross-domain-detectio
Long-term Leap Attention, Short-term Periodic Shift for Video Classification
Video transformer naturally incurs a heavier computation burden than a static
vision transformer, as the former processes times longer sequence than the
latter under the current attention of quadratic complexity . The
existing works treat the temporal axis as a simple extension of spatial axes,
focusing on shortening the spatio-temporal sequence by either generic pooling
or local windowing without utilizing temporal redundancy.
However, videos naturally contain redundant information between neighboring
frames; thereby, we could potentially suppress attention on visually similar
frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a
long-term ``\textbf{\textit{Leap Attention}}'' (LA), short-term
``\textbf{\textit{Periodic Shift}}'' (\textit{P}-Shift) module for video
transformers, with complexity. Specifically, the ``LA'' groups
long-term frames into pairs, then refactors each discrete pair via attention.
The ``\textit{P}-Shift'' exchanges features between temporal neighbors to
confront the loss of short-term dynamics. By replacing a vanilla 2D attention
with the LAPS, we could adapt a static transformer into a video one, with zero
extra parameters and neglectable computation overhead (2.6\%).
Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS
transformer could achieve competitive performances in terms of accuracy, FLOPs,
and Params among CNN and transformer SOTAs. We open-source our project in
\sloppy
\href{https://github.com/VideoNetworks/LAPS-transformer}{\textit{\color{magenta}{https://github.com/VideoNetworks/LAPS-transformer}}} .Comment: Accepted by ACM Multimedia 2022, 10 pages, 4 figure
Fusing semantics, observability, reliability and diversity of concept detectors for video search
ABSTRACT Effective utilization of semantic concept detectors for largescale video search has recently become a topic of intensive studies. One of main challenges is the selection and fusion of appropriate detectors, which considers not only semantics but also the reliability of detectors, observability and diversity of detectors in target video domains. In this paper, we present a novel fusion technique which considers different aspects of detectors for query answering. In addition to utilizing detectors for bridging the semantic gap of user queries and multimedia data, we also address the issue of "observability gap" among detectors which could not be directly inferred from semantic reasoning such as using ontology. To facilitate the selection of detectors, we propose the building of two vector spaces: semantic space (SS) and observability space (OS). We categorize the set of detectors selected separately from SS and OS into four types: anchor, bridge, positive and negative concepts. A multi-level fusion strategy is proposed to novelly combine detectors, allowing the enhancement of detector reliability while enabling the observability, semantics and diversity of concepts being utilized for query answering. By experimenting the proposed approach on TRECVID 2005-2007 datasets and queries, we demonstrate the significance of considering observability, reliability and diversity, in addition to the semantics of detectors to queries
On the Selection of Anchors and Targets for Video Hyperlinking
A problem not well understood in video hyperlinking is what qualifies a
fragment as an anchor or target. Ideally, anchors provide good starting points
for navigation, and targets supplement anchors with additional details while
not distracting users with irrelevant, false and redundant information. The
problem is not trivial for intertwining relationship between data
characteristics and user expectation. Imagine that in a large dataset, there
are clusters of fragments spreading over the feature space. The nature of each
cluster can be described by its size (implying popularity) and structure
(implying complexity). A principle way of hyperlinking can be carried out by
picking centers of clusters as anchors and from there reach out to targets
within or outside of clusters with consideration of neighborhood complexity.
The question is which fragments should be selected either as anchors or
targets, in one way to reflect the rich content of a dataset, and meanwhile to
minimize the risk of frustrating user experience. This paper provides some
insights to this question from the perspective of hubness and local intrinsic
dimensionality, which are two statistical properties in assessing the
popularity and complexity of data space. Based these properties, two novel
algorithms are proposed for low-risk automatic selection of anchors and
targets.Comment: ACM International Conference on Multimedia Retrieval (ICMR), 2017.
(Oral
Incremental Learning on Food Instance Segmentation
Food instance segmentation is essential to estimate the serving size of
dishes in a food image. The recent cutting-edge techniques for instance
segmentation are deep learning networks with impressive segmentation quality
and fast computation. Nonetheless, they are hungry for data and expensive for
annotation. This paper proposes an incremental learning framework to optimize
the model performance given a limited data labelling budget. The power of the
framework is a novel difficulty assessment model, which forecasts how
challenging an unlabelled sample is to the latest trained instance segmentation
model. The data collection procedure is divided into several stages, each in
which a new sample package is collected. The framework allocates the labelling
budget to the most difficult samples. The unlabelled samples that meet a
certain qualification from the assessment model are used to generate
pseudo-labels. Eventually, the manual labels and pseudo-labels are sent to the
training data to improve the instance segmentation model. On four large-scale
food datasets, our proposed framework outperforms current incremental learning
benchmarks and achieves competitive performance with the model trained on fully
annotated samples
- …